--- title: Exploratory Data Analysis keywords: fastai sidebar: home_sidebar summary: "Analysis of classifcation data" description: "Analysis of classifcation data" ---
img_files = get_files(path, extensions=image_extensions, recurse=True)
img_files[0]
Let's start by grabbing some of the images from our unorganized dataset and taking a look at them.
def show_images(images, cols = 1, titles = None):
"""Display a list of images in a single figure with matplotlib.
images: List of np.arrays compatible with plt.imshow.
cols (Default = 1): Number of columns in figure (number of rows is
set to np.ceil(n_images/float(cols))).
"""
assert((titles is None)or (len(images) == len(titles)))
n_images = len(images)
if titles is None:
titles = ['Image (%d)' % i for i in range(1,n_images + 1)]
fig = plt.figure()
for n, (i, title) in enumerate(zip(images, titles)):
image = mpimg.imread(i)
a = fig.add_subplot(cols, np.ceil(n_images/float(cols)), n + 1)
if image.ndim == 2:
plt.gray()
plt.imshow(image)
a.set_title(title)
fig.set_size_inches(np.array(fig.get_size_inches()) * n_images)
plt.show()
show_images(random.choices(img_files, k=6), 3)
show_images(random.choices(img_files, k=6), 3)
Our goal here is to classify these images by the presence or absence of certain types of deer.
Looking at these random selections its clear that while some features are consistent, like the feeder structures, there is a good amount of variety in how the deer appear.
A few appear to have no deer present, some just a quarter or half. Some of the animals are close and others in the background. Some are clearly taken in the day while others at night.
Basically, these image have large variations in size, rotation and position of the animals (objects of interest).
Now we'll look at some individual files and see what metadata we can collect that might be useful.
night = Image.open(img_files[466]); night
day = Image.open(img_files[1]); day
day_exif = { ExifTags.TAGS[k]: v for k, v in day._getexif().items() if k in ExifTags.TAGS }
night_exif = { ExifTags.TAGS[k]: v for k, v in night._getexif().items() if k in ExifTags.TAGS }
day_exif.keys()
for k in day_exif.keys():
if k != 'MakerNote':
print(f'{k} :' , f'{day_exif[k]}'.rjust(45-len(k)),f'{night_exif[k]}'.rjust(35))
The tag SceneCaptureType indicates the type of scene that was shot.
This will allow us to quicky and easily sort the images into night and day
The images shot in nightmode turn out to be 3 channels - just like the full color day images.
night_a = np.array(night)
night_a.shape
Though there are 3 channels they appear to be nearly identical.
np.linalg.norm(night_a[:,:,0]), np.linalg.norm(night_a[:,:,1]), np.linalg.norm(night_a[:,:,2])
This is not the case with the color images.
day_a = np.array(day)
np.linalg.norm(day_a [:,:,0]), np.linalg.norm(day_a [:,:,1]), np.linalg.norm(day_a [:,:,2])
We can open the image with only one channel to confirm that they are basically identical except for the logo in the bottom right.
Image.fromarray(night_a[:,:,0], mode='L')
Image.fromarray(night_a[:,:,1], mode='L')
Image.fromarray(night_a[:,:,2], mode='L', )
To get an idea of the intensity distributions of night vs day let's create histograms.
We'll bin the image pixels into 256 bins and plot them along the x axis.
night = cv2.imread(img_files[466].as_posix())
color = ('b','g','r')
for i,col in enumerate(color):
histr = cv2.calcHist([night],[i],None,[256],[0,256])
plt.plot(histr,color = col)
plt.xlim([-10,270])
plt.title('Histogram for Night')
plt.show()
This histogram demonstrations what we observed above which is that nearly all of the channels are identical in the night image.
The majority of the colors are found at black (#000000).
And there is a significant amount to be found in the grey range that peaks around 25.
At the very end we can see a cluster of white.
day = cv2.imread(img_files[0].as_posix())
color = ('b','g','r')
for i,col in enumerate(color):
histr = cv2.calcHist([day],[i],None,[256],[0,256])
plt.plot(histr,color = col)
plt.xlim([-10,270])
plt.title('Histogram for day')
plt.show()
The tonal range of the day image gives us less information.
Based on the analysis above we'll assume we need at least two distinct sets of models: one for night and one for day.
We can start the sorting process by using the metadata to separate the images into night and day.